March 22, 2005

Why multivariate analysis?

Landscape of tools

What we will cover today

Function for plotting

The iris dataset

head(iris, 4)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa

From left to right: I. setosa, I. veriscolor, I. virginica.

Principal Component Analysis (PCA)

Principal Component Analysis (PCA)

PCA is a tool to compress information from one table into a more manageble space, a PC space.

PC is a new coordinate system such that the directions (principal components) capture the largest variation in the data.

PCA principles

# scale and center table
d <- as.data.frame(scale(iris[, 1:4], center = TRUE, scale = TRUE))

# covariance matrix
covariance_matrix <- cov(d)

# eigenvalues
lambdas <- eigen(covariance_matrix)$values
importance_principal_components <- lambdas / sum(lambdas)

# eigenvectors
v <- eigen(covariance_matrix)$vectors
principal_components <- t(t(v) / colSums(v))

PCA in R

pca <- prcomp(iris[, 1:4], center = TRUE, scale = TRUE)
summary(pca)
Importance of components:
                          PC1    PC2     PC3     PC4
Standard deviation     1.7084 0.9560 0.38309 0.14393
Proportion of Variance 0.7296 0.2285 0.03669 0.00518
Cumulative Proportion  0.7296 0.9581 0.99482 1.00000

PCA biplot

biplot(pca, xlabs = rep("+", nrow(iris)))

Petals and Sepals

Petal.Length and Petal.Width are highly collinear.

Sepal.Length and Sepal.Width are independent.

A color PCA biplot

plot(pca$x[, 1:2], col = color_for_species(iris$Species))

Principal Coordinate Analysis (PCoA)

Principal Coordinate Analysis (PCoA)

Principal Coordinate Analysis (PCoA), also known as metric Multi-Dimensional Scaling (mMDS) is similar to PCA, but it needs distance matrices.

For example, the Euclidean distance: \(D = \sqrt{\sum{(x_i - x_j) ^ 2}}\)

PCoA tries to order data on a plot, where the distance between points is proportional to the distance between data.

Principal Coordinate Analysis (PCoA)

\(D = \sqrt{\sum{(x_i - x_j) ^ 2}}\)

head(iris, 2)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
# calculating by hand
sqrt(sum( (iris[1, 1:4] - iris[2, 1:4])^2 ))
[1] 0.5385165
distance <- dist(iris[, 1:4])
as.matrix(distance)[1:3, 1:3]
          1         2        3
1 0.0000000 0.5385165 0.509902
2 0.5385165 0.0000000 0.300000
3 0.5099020 0.3000000 0.000000

Principal Coordinate Analysis (PCoA)

image(as.matrix(distance), col = hcl.colors(100, "Zissou1"))

PCoA in R

pcoa <- cmdscale(distance, k = 10, eig = TRUE)
barplot(pcoa$eig[1:4] / sum(pcoa$eig))

PCoA in R

par(mfrow = c(1, 2), mar = c(4, 4, 2, 2))
pcoa <- cmdscale(distance, k = 2)
plot(pcoa, col = color_for_species(iris$Species))

PCoA vs PCA R

par(mfrow = c(1, 2), mar = c(4, 4, 2, 2))
pcoa <- cmdscale(distance, k = 2)
plot(pcoa, col = color_for_species(iris$Species))
plot(pca$x[, 1:2], col = color_for_species(iris$Species))

PCoA vs PCA R

par(mfrow = c(1, 2), mar = c(4, 4, 2, 2))
pcoa <- cmdscale(distance, k = 2)
plot(pcoa, col = color_for_species(iris$Species))
plot(pca$x[, 1], -pca$x[, 2], col = color_for_species(iris$Species))

non-metric Multi-Dimensional Scaling (nMDS)

nMDS

PCoA tries to order data on a plot, where the distance between points is proportional to the distance between data.

Sometimes this is not possible in a metric space.

Non-metric Multi-Dimensional Scaling (nMDS) extend PCoA to get the best representations of points on a plot.

nMDS in R

nMDS <- MASS::isoMDS(distance + 1e-9)  # distances cannot be zero
initial  value 3.025865 
iter   5 value 2.637651
final  value 2.582478 
converged
plot(nMDS$points, col = color_for_species(iris$Species))

nMDS vs PCoA

par(mfrow = c(1, 2), mar = c(4, 4, 2, 2))
plot(pcoa, col = color_for_species(iris$Species))
plot(nMDS$points, col = color_for_species(iris$Species))

K-means clustering

K-means clustering: principles

Partition \(n\) observations into \(k\) clusters.

clustering <- kmeans(pca$x[, 1:2], centers = 3)  # centers is the number of clusters
plot(
  pca$x[, 1:2],
  col = as.factor(clustering$cluster),
  pch = sapply(iris$Species, \(x) which(levels(as.factor(iris$Species)) == x))
)

Take-home messages

  • When you have multiple response variables or too many dimensions, you need multivariate analysis.
  • Multivariate analysis compresses information so that you can work with it easier.
  • If you have raw values, use PCA.
  • If you have distances or dissimilarities, use PCoA or nMDS.
  • K-mean clustering: supervised machine learning.